44 research outputs found
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
In this paper, we introduce SoccerNet, a benchmark for action spotting in
soccer videos. The dataset is composed of 500 complete soccer games from six
main European leagues, covering three seasons from 2014 to 2017 and a total
duration of 764 hours. A total of 6,637 temporal annotations are automatically
parsed from online match reports at a one minute resolution for three main
classes of events (Goal, Yellow/Red Card, and Substitution). As such, the
dataset is easily scalable. These annotations are manually refined to a one
second resolution by anchoring them at a single timestamp following
well-defined soccer rules. With an average of one event every 6.9 minutes, this
dataset focuses on the problem of localizing very sparse events within long
videos. We define the task of spotting as finding the anchors of soccer events
in a video. Making use of recent developments in the realm of generic action
recognition and detection in video, we provide strong baselines for detecting
soccer events. We show that our best model for classifying temporal segments of
length one minute reaches a mean Average Precision (mAP) of 67.8%. For the
spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances
ranging from 5 to 60 seconds. Our dataset and models are available at
https://silviogiancola.github.io/SoccerNet.Comment: CVPR Workshop on Computer Vision in Sports 201
Integration of Absolute Orientation Measurements in the KinectFusion Reconstruction pipeline
In this paper, we show how absolute orientation measurements provided by
low-cost but high-fidelity IMU sensors can be integrated into the KinectFusion
pipeline. We show that integration improves both runtime, robustness and
quality of the 3D reconstruction. In particular, we use this orientation data
to seed and regularize the ICP registration technique. We also present a
technique to filter the pairs of 3D matched points based on the distribution of
their distances. This filter is implemented efficiently on the GPU. Estimating
the distribution of the distances helps control the number of iterations
necessary for the convergence of the ICP algorithm. Finally, we show
experimental results that highlight improvements in robustness, a speed-up of
almost 12%, and a gain in tracking quality of 53% for the ATE metric on the
Freiburg benchmark.Comment: CVPR Workshop on Visual Odometry and Computer Vision Applications
Based on Location Clues 201
A metrological characterization of the Kinect V2 time-of-flight camera
A metrological characterization process for time-of-flight (TOF) cameras is proposed in this paper and applied to the Microsoft Kinect V2. Based on the Guide to the Expression of Uncertainty in Measurement (GUM), the uncertainty of a three-dimensional (3D) scene reconstruction is analysed. In particular, the random and the systematic components of the uncertainty are evaluated for the single sensor pixel and for the complete depth camera. The manufacturer declares an uncertainty in the measurement of the central pixel of the sensor of about few millimetres (Kinect for Windows Features, 2015), which is considerably better than the first version of the Microsoft Kinect (Chow et al., 2012 [1]). This work points out that performances are highly influenced by measuring conditions and environmental parameters of the scene; actually the 3D point reconstruction uncertainty can vary from 1.5 to tens of millimetres
MVTN: Learning Multi-View Transformations for 3D Understanding
Multi-view projection techniques have shown themselves to be highly effective
in achieving top-performing results in the recognition of 3D shapes. These
methods involve learning how to combine information from multiple view-points.
However, the camera view-points from which these views are obtained are often
fixed for all shapes. To overcome the static nature of current multi-view
techniques, we propose learning these view-points. Specifically, we introduce
the Multi-View Transformation Network (MVTN), which uses differentiable
rendering to determine optimal view-points for 3D shape recognition. As a
result, MVTN can be trained end-to-end with any multi-view network for 3D shape
classification. We integrate MVTN into a novel adaptive multi-view pipeline
that is capable of rendering both 3D meshes and point clouds. Our approach
demonstrates state-of-the-art performance in 3D classification and shape
retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55).
Further analysis indicates that our approach exhibits improved robustness to
occlusion compared to other methods. We also investigate additional aspects of
MVTN, such as 2D pretraining and its use for segmentation. To support further
research in this area, we have released MVTorch, a PyTorch library for 3D
understanding and generation using multi-view projections.Comment: under review journal extension for the ICCV 2021 paper
arXiv:2011.1324
Learning Semantic Segmentation with Query Points Supervision on Aerial Images
Semantic segmentation is crucial in remote sensing, where high-resolution
satellite images are segmented into meaningful regions. Recent advancements in
deep learning have significantly improved satellite image segmentation.
However, most of these methods are typically trained in fully supervised
settings that require high-quality pixel-level annotations, which are expensive
and time-consuming to obtain. In this work, we present a weakly supervised
learning algorithm to train semantic segmentation algorithms that only rely on
query point annotations instead of full mask labels. Our proposed approach
performs accurate semantic segmentation and improves efficiency by
significantly reducing the cost and time required for manual annotation.
Specifically, we generate superpixels and extend the query point labels into
those superpixels that group similar meaningful semantics. Then, we train
semantic segmentation models, supervised with images partially labeled with the
superpixels pseudo-labels. We benchmark our weakly supervised training approach
on an aerial image dataset and different semantic segmentation architectures,
showing that we can reach competitive performance compared to fully supervised
training while reducing the annotation effort.Comment: Paper presented at the LXCV workshop at ICCV 202